GVPT Maths Camp

Data Visualisation II

Learning objectives for today

  1. Introduction to R

  2. Improve your first plot

  3. Test your hypotheses using informative data visualizations

Skipping to the end

How did we do this?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

EXERCISE

How many rows are in mpg? How many columns?

nrow(mpg)
ncol(mpg)


What does the drv variable describe?

?mpg

EXERCISE

Make a scatterplot of hwy vs cyl.


What happens if you make a scatterplot of class vs drv? Why is the plot not useful?


Which geom might be a better choice?

EXERCISE

Why does the following give an error and how would you fix it?

ggplot(data = mpg) + 
  geom_point()


Add the following caption to the plot you made in the previous exercise: “Data come from the ggplot2 package.” HINT: Look at the documentation for labs().

Flexible visualization

You can use visual elements to communicate your findings in engaging ways.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class == "2seater"))

Changing the look of your plots

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), colour = "red")

EXERCISE

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

EXERCISE

Which variables in mpg are categorical? Which variables are continuous?


Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?


What happens if you map the same variable to multiple aesthetics?

Let’s clean our graph up

Less is more when it comes to data visualization.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  theme_minimal() + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

Let’s clean this up

EXERCISE

Head over to the ggplot documentation and find your favorite preset theme.

Creating your own theme

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

Creating your own theme

The before shot

EXERCISE

  1. Customize the last plot you made using the theme() argument.

Visualizing distributions

ggplot(mpg, aes(x = drv)) + 
  geom_bar()

Visualizing distributions

Reorder in relation to frequency

ggplot(mpg, aes(x = fct_infreq(drv))) +
  geom_bar()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy)) +
  geom_histogram()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy)) +
  geom_density()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy, colour = drv)) +
  geom_density()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy, colour = drv, fill = drv)) +
  geom_density(alpha = 0.5)

Summary

This session you:

  1. Set up your data science tools

  2. Plotted complex data in an engaging way

  3. Discovered interesting relationships in the data

  4. Connected these relationships or trends to your expectations (or hypotheses about the data)

HOMEWORK

In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:

  1. Is relevant to your research interests,

  2. Contains continuous and discrete variables.